Method and apparatus for enhancing processing speed for performing a least mean square operation by parallel processing

ABSTRACT

The present invention is related to a method and apparatus for enhancing processing speed for performing a least mean square operation by parallel processing. The input data are subdivided into a plurality of groups and processed by a plurality of adaptive filters in parallel. A Jaber product device is utilized for rearranging the processing results. A subtractor subtracts the output from the Jaber product device from a desired result to generate an error signal. A feedback network adjusts the adaptive filters in accordance with the error signal.

CROSS REFERENCE TO RELATED APPLICATION

This application claims the benefit of U.S. Provisional Application No. 60/561,956, filed Apr. 13, 2004 which is incorporated by reference as if fully set forth.

FIELD OF INVENTION

The present invention is related to a method and apparatus for enhancing processing speed for performing a least-mean square operation by parallel processing.

BACKGROUND

Enormous computational load is one of the problems in modern signal processing applications. In order to solve the computational load problem, hardware organization techniques for exploiting parallelism have been developed. Parallel computing is performed by utilizing multiple concurrent processes in the fulfillment of a common task. These processes may be executed on different processors of a multi-processor computer in parallel. Massive parallel processors may even employ thousands of processors which are connected by very fast interconnection networks.

The least-mean square (LMS) algorithm is a linear adaptive filtering algorithm that consists of two basic processes: 1) a filtering process, which involves computing the output of a transversal filter by a set of tap inputs and generating an estimation error by comparing this output to a desired response; and 2) an adaptive process, which involves an automatic adjustment of the tap weights of the filter in accordance with the estimation error. FIG. 3 shows an adaptive filter in which the LSM algorithm has been implemented on parallel processors. However, this introduces a problem in that the rest of operations should be done sequentially.

SUMMARY

The present invention is related to a method and apparatus for enhancing processing speed for performing a least mean square operation by parallel processing. The input data are subdivided into a plurality of groups and processed by a plurality of adaptive filters in parallel. A Jaber product device is utilized for rearranging the processing results. A subtractor subtracts the output from the Jaber product device from a desired result to generate an error signal. A feedback network adjusts the adaptive filters in accordance with the error signal.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of an adaptive transversal filter.

FIG. 2 is a detailed block diagram of the transversal filter of FIG. 1.

FIG. 3 is a detailed block diagram of an adaptive weight control mechanism.

FIG. 4 is a diagram of linear regression.

FIG. 5 is a diagram of a linear regression result.

FIG. 6 is a block diagram of a linear regression processing element.

FIG. 7 shows fitting a regression plane to a set of samples in two-dimensional space.

FIG. 8 is a diagram of a regression system for multiple inputs.

FIG. 9 shows a regression result with deviations.

FIG. 10 is a block diagram of a regression system and a plot diagram of inputs and a regression result.

FIG. 11 is a diagram of a performance surface for 2 dimensions and its contour plot.

FIG. 12 is a diagram of the performance surface.

FIG. 13 a diagram of performance surface and its gradient.

FIG. 14 is a diagram of contour plots of the performance surface with two weights.

FIG. 15 shows a procedure for searching the minimum using the gradient information.

FIG. 16 is a diagram of a learning curve.

FIG. 17 is a diagram of weight tracks and plots of the weight values across iteration for 3 values of η.

FIGS. 18(a) and 18(b) are diagrams of a weight track toward the minimum; equal eigenvalues and unequal eigenvalues cases, respectively.

FIG. 19 shows rattling of the iteration procedure.

FIG. 20 shows directions of the steepest descent and Newton's method.

FIG. 21 is a diagram of a biased Jaber parallel processing element (BJPPE) in accordance with the present invention.

FIG. 22 is a diagram of an unbiased Jaber parallel processing element (UJPPE) in accordance with the present invention.

FIG. 23 is a diagram of a biased Jaber parallel linear regressor (BJPLR) in accordance with the present invention.

FIG. 24 is a diagram of an unbiased Jaber parallel linear regressor (UJPLR) in accordance with the present invention.

FIG. 25 is a diagram of an optimized structure of the biased Jaber parallel linear regressor in accordance with the present invention.

FIG. 26 is a diagram of an optimized structure of the unbiased Jaber parallel linear regressor in accordance with the present invention.

FIG. 27 is a diagram of a biased Jaber partial linear regressor in accordance with the present invention.

FIG. 28 is a diagram of an unbiased Jaber partial linear regressor in accordance with the present invention.

FIG. 29 is a diagram of an unbiased Jaber parallel regressor system in accordance with the present invention.

FIG. 30 is a diagram of a simplified representation of the unbiased Jaber parallel regressor system in accordance with the present invention.

FIG. 31 is a diagram of a biased Jaber parallel regressor system in accordance with the present invention.

FIG. 32 is a diagram of an optimized structure of the unbiased Jaber parallel regressor system in accordance with the present invention.

FIG. 33 is a diagram of an optimized structure of the biased Jaber parallel regressor system in accordance with the present invention.

FIG. 34 is a diagram of an unbiased Jaber partial regressor system in accordance with the present invention.

FIG. 35 is a diagram of a biased Jaber partial regressor system in accordance with the present invention.

FIG. 36 is a diagram of a predictive filter implementing an r parallel LMS algorithm in accordance with the present invention.

FIGS. 37-39 show simulation result of the filter of FIG. 36 in accordance with the present invention.

FIG. 40 is a diagram of a parallel regression processor in accordance with the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

FIG. 1 is a block diagram of an adaptive transversal filter 10 including a transversal filter 12, a summer 14 and an adaptive weight control mechanism 16. The combination of these two processes constitutes a feedback loop around the LMS algorithm, implemented within the transversal filter 12.

FIG. 2 is a detailed block diagram of the transversal filter 12 of FIG. 1. The tap inputs x(n), x(n−1), . . . , x(n−M+1) from the elements of the of the M-by-1 tap input vector x(n), (where M−1 is the number of delay elements), span a multidimensional space X(n). The value computed for the tap-weight vector w(n) using the LMS algorithm represents an estimate whose expected value approaches the Wiener solution w₀ (for a wide-sense stationary environment) as the number of iterations n approaches infinity.

During the filtering process, the desired response d(n) is supplied for processing alongside the tap-input vector x(n). Given this input, the transversal filter-12 produces an output y_((n)) used as an estimate of the desired response d(n). Accordingly, an estimation error e(n) is defined as a difference between the desired response and the actual filter output, as indicated in the output end of FIG. 2. The estimation error e(n) and the tap-input vector x(n) are applied to the adaptive weight control mechanism 16.

FIG. 3 is a diagram of detailed structure of the adaptive weight control mechanism 16. A scalar version of the inner product of the estimation error e(n) and the tap input vector x(−k) is computed for k=0, 1, . . . , M−2, M−1. The result so obtained defines the correction δw_(k)(n) applied to the tap weight w_(k)(n) at iteration n+1. The scaling factor used in this computation is denoted by μ which is called a step-size parameter.

Adaptive systems are those who change their parameters (through algorithms) in order to meet a pre-specified function, which is either an input-output map or an internal constraint. There are basically two ways to adapt the system parameters: 1) supervised adaptation when there is a desired response that can be used by the system to guide the adaptation; and 2) unsupervised learning when the system parameters are adapted using only the information of the input and are constrained by pre-specified internal rules.

Regression models are used to predict one variable from one or more other variables. Regression models provide a powerful tool, allowing predictions about past, present, or future events to be made with information about past or present events. Linear regression analyzes the relationship between two variables, x and d. The slope and/or intercept have a scientific meaning, and the linear regression line is used as a standard curve to find new values of x from d, or d from x. The goal of linear regression is to adjust the values of slope and intercept to find the line that best predicts d from x. More precisely, the goal of regression is to minimize the sum of the squares of the vertical distances of the points from the line. FIG. 4 is a diagram of a linear regression result. The slope quantifies the steepness of the line. It equals the change in d for each unit change in x. It is expressed in the units of the d-axis divided by the units of the x-axis. If the slope is positive, d increases as x increases. If the slope is negative, d decreases as x increases. The d intercept is the d value of the line when x equals zero. It basically defines the elevation of the line.

FIG. 5 is a diagram of the plotting of x vs. d. The relationship between the two variables x and d is complex. However, as shown in FIG. 5, when x increases d also increases, (i.e. there is a general trend in the data). The deviation from the straight line could be produced by noise or miss-measurement, and underlying the apparent complexity could be a very simple (possibly linear) relationship between x and d, i.e: d≈w×x+b;   Equation (1) or more specifically: d _((n)) =w×x _((n)) +b+e _((n)) =y _((n)) +e _((n));   Equation (2) where e_((n)) is the instantaneous error that is added to y_((n)) (the linearly fitted value), w is the slope and b is the y intersect (or bias). The problem can be solved by a linear system with only two free parameters, the slope w and the bias b.

FIG. 6 is a diagram of a linear processing element (PE), or an adaptive linear element (ADALINE). It comprises two multipliers and one adder. The multiplier w scales the input, and the multiplier b is a simple bias, which can also be thought of as an extra input connected to the value +1. The parameters (b, w) have different functions in the solution.

The general purpose of multiple regression is to learn the relationship between several independent or predictor variables and a dependent or criterion variable. Assume that d is a function of several inputs x₁, x₂, . . . , x_(p) (independent variables), and the goal is to find the best linear regressor of d on all the inputs. For p=2 this corresponds to fitting a plane through the N input samples, or a hyper-plane in the general case of p dimensions. FIG. 7 shows the fitting of a regression plane to a set of samples in 2D space.

In general case, the regression equation is expressed as: $\begin{matrix} {{e_{(i)} = {{d_{(i)} - \left( {b + {\sum\limits_{k = 0}^{p}\quad{w_{(k)}x_{({i,k})}}}} \right)} = {d_{(i)} - {\sum\limits_{k = 0}^{p}\quad{w_{(k)}x_{({i,k})}}}}}};} & {{Equation}\quad(3)} \end{matrix}$ where w₀=b and x_((i, 0))=1. The goal of the regression is to find the coefficient vector W that minimizes the mean square error (MSE) of e(i) over the i samples. FIG. 8 shows that the linear PE has p inputs and one bias.

The method of least squares assumes that the best-fit curve of a given type is the curve that has the minimal sum of the deviations squared (least square error) from a given set of data. Suppose that the data points are (x₁, y₁), (x₂, y₂), . . . , (x_(n), y_(n)), where x is the independent variable and y is the dependent variable. The fitting curve d has the deviation (error) a from each data point, i.e., σ₁=d₁−y₁, σ₂=d₂−y₂, . . . , σ_(n)=d_(n)−y_(n). According to the method of least squares, the best fitting curve has the property that: $\begin{matrix} {J = {{\sigma_{1}^{2} + \sigma_{2}^{2} + \ldots + \sigma_{n}^{2}} = {{\sum\limits_{i = 1}^{n}\quad\sigma_{i}^{2}} = {{\sum\limits_{i = 1}^{n}\quad\left\lbrack {d_{(i)} - y_{(i)}} \right\rbrack^{2}} = {a\quad{minimum}}}}}} & {{Equation}\quad(4)} \end{matrix}$

Least squares solves the problem by finding the line for which the sum of the square deviations (or residuals) in the d direction are minimized. FIG. 9 is a diagram of a regression line showing the deviations.

The goal is to find a systematic procedure to find the constants b and w which minimizes the error between the true value d_((n)) and the estimated value y_((n)) which is called the linear regression: d _((n))−(b+wx _((n)))=d _((n)) −y _((n)) =e _((n)).   Equation (5)

In order to pick the line which best fits the data, a criterion is required to determine which linear estimator is the “best”. The sum of square errors (also called a mean square error (MSE)) is a widely utilized performance criterion: $\begin{matrix} {{J = {\frac{1}{2N}{\sum\limits_{n = 1}^{N}\quad e_{(n)}^{2}}}};} & {{Equation}\quad(6)} \end{matrix}$ where N in the number of observations and J is the mean square error.

The goal is to minimize J analytically, which can be achieved by taking the partial derivative of this quantity with respect to the unknowns and equate the resulting equations to zero, as follows: $\begin{matrix} \left\{ {\begin{matrix} {\frac{\delta\quad J}{\delta\quad b} = 0} \\ {\frac{\delta\quad J}{\delta\quad w} = 0} \end{matrix};} \right. & {{Equation}\quad(7)} \end{matrix}$ which yields after some manipulation to: $\begin{matrix} {{b = \frac{{\sum\limits_{n}\quad{x_{(n)}^{2}{\sum\limits_{n}\quad d_{(n)}}}} - {\sum\limits_{n}\quad{x_{(n)}{\sum\limits_{n}\quad{x_{(n)}d_{(n)}}}}}}{N{\sum\limits_{n}\quad\left( {x_{(n)} - \overset{\_}{x}} \right)^{2}}}}{{w = \frac{\sum\limits_{n}\quad{\left( {x_{(n)} - \overset{\_}{x}} \right)\left( {d_{(n)} - \overset{\_}{d}} \right)}}{\sum\limits_{n}\quad\left( {x_{(n)} - \overset{\_}{x}} \right)^{2}}};}} & {{Equation}\quad(8)} \end{matrix}$ where the bar over the variable means the mean value and the procedure to determine the coefficients of the line is called the least square method.

Least square method for a multiple variable case is explained. The MSE in multiple variable case becomes: $\begin{matrix} {J = {\frac{1}{2N}{\sum\limits_{n}\quad{\left( {d_{(n)} - {\sum\limits_{k = 0}^{p}\quad{w_{(k)}x_{({n,k})}}}} \right)^{2}.}}}} & {{Equation}\quad(9)} \end{matrix}$

The solution to the extreme (minimum) of this equation can be found exactly in the same way as before, i.e. by taking the derivatives of J with respect to the unknowns (w_((k))), and equating the result to zero. This solution is the normal matrix equation: $\begin{matrix} {{{\sum\limits_{n}\quad{x_{({n,j})}d_{(n)}}} = {\sum\limits_{k = 0}^{p}\quad{w_{(k)}{\sum\limits_{n}\quad{x_{({n,k})}x_{({n,j})}}}}}};} & {{Equation}\quad(10)} \end{matrix}$ for j=0, 1, . . . , p.

The size of the MSE (J) can be used to determine which line best fits the data. However, it does not necessarily reflect whether a line fits the data at all. The size of the MSE is dependent upon the number of data samples and the magnitude (or power) of the data samples. For instance, by simply scaling the data, the MSE can be changed without changing how well the data is fit by the regression line.

The correlation coefficient (c) solves this problem by comparing the variance of the predicted value with the variance of the desired value. The value r² represents the amount of variance in the data captured by the linear regression: $\begin{matrix} {c^{2} = {\frac{\sum\limits_{n}\quad\left( {y_{(n)} - \overset{\_}{d}} \right)^{2}}{\sum\limits_{n}\quad\left( {d_{(n)} - \overset{\_}{d}} \right)^{2}}.}} & {{Equation}\quad(11)} \end{matrix}$

By substituting y_((n)) by the equation of the regression line, a correlation coefficient is obtained as follows: $\begin{matrix} {c = {\frac{\frac{\sum\limits_{n}\quad{\left( {x_{(n)} - \overset{\_}{x}} \right)\left( {d_{(n)} - \overset{\_}{d}} \right)}}{N}}{\sqrt{\frac{\sum\limits_{n}\quad\left( {d_{(n)} - \overset{\_}{d}} \right)^{2}}{N}}\sqrt{\frac{\sum\limits_{n}\quad\left( {x_{(n)} - \overset{\_}{x}} \right)^{2}}{N}}}.}} & {{Equation}\quad(12)} \end{matrix}$

The numerator is the covariance of the two variables, and the denominator is the product of the corresponding standard deviations. The correlation coefficient is confined to the range [−1, 1]. When c=1 there is a perfect positive correlation between x and d, (i.e. they co-vary). When c=−1, there is a perfect negative correlation between x and d, (i.e. they vary in opposite ways). When c=0 there is no correlation between x and d. Intermediate values describe partial correlations.

The method of least squares is very powerful, since it has no bias and minimizes the variance. It can be generalized to higher order polynomial curves such as quadratics, cubic, etc. Regression can also be extended to multiple variables, where the dependence of d is not a single variable x, but a vector X=[x₁, . . . , x_(p)]^(T), where T means a transpose and vectors are denoted by capital letters. In this case the regression line becomes a hyper-plane in the space x₁, x₂, . . . , x_(p).

The autocorrelation of the input samples for indices k, j is defined as follows: $\begin{matrix} {R_{({n,j})} = {\frac{1}{N}{\sum\limits_{n}\quad{x_{({n,k})}{x_{({n,j})}.}}}}} & {{Equation}\quad(13)} \end{matrix}$ Knowing that the autocorrelation measures similarity across the input samples, when k=j, R is just the sum of the squares of the input samples (the power in the data). When k differs from j, R measures the sum of the cross-products for every possible combination of the indices. Thus, one obtains information about the structure of the data set. The cross-correlation of the input x for index j and desired response y is defined as follows: $\begin{matrix} {{P_{(j)} = {\frac{1}{N}{\sum\limits_{n}\quad{x_{({n,j})}d_{(n)}}}}};} & {{Equation}\quad(14)} \end{matrix}$ which can be also put into a vector P of dimension p+1. Substituting these definitions in Equation (10), the set of normal equations can be written by: P=RW* or W*=R ⁻¹ P;   Equation (15) where W is a vector with the p+1 weights w_(i) in which W* represents the value of the vector for the optimum (minimum) solution. The solution of the multiple regression problems can be computed analytically as the product of the inverse of the autocorrelation of the input samples multiplied by the cross-correlation vector of the input and the desired response.

Since the regression coefficient for x₁ is larger than x₂, the variable x₁ (speed) affects the quality more than variable x₂ (feed rate). The multiple correlation coefficient c_(m) can also be defined in the multiple dimensional case for a single output, as follows: $\begin{matrix} {{c_{(m)} = \sqrt{\frac{{W^{*}U_{(x)}D} - {N{\overset{\_}{d}}^{2}}}{{D^{T}D} - {N{\overset{\_}{d}}^{2}}}}};} & {{Equation}\quad(16)} \end{matrix}$ which measures the amount of variation explained by the linear regression, normalized by the variance of D. In this expression, D is the vector built from the desired responses d_(i), and U is a matrix containing the input data vectors.

The purpose of least squares is to find parameters (b, w₁, w₂, . . . , w_(p)) that minimize the difference between the system output y_((n)) and the desired response d_((n)). Regression is a procedure for effectively computing the optimal parameters of an interpolating system which predicts the value of d from the value of x.

FIG. 10 shows the operation of adapting the parameters of the linear system. The system output y is always a linear combination of the input x with the bias, so it has to lie on a straight line of equation y=wx+b. Changing b modifies the y intersect, while changing w modifies the slope. Linear regression adjusts the position of the line such that the average square difference between the y values (on the line) and the cloud of points d_((n)) i.e. the error e_((n)), is minimized.

The key point is to recognize that the error contains information which can be used to optimally place the line. FIG. 10 shows this by including a subsystem which accepts the error and modifies the parameters of the system. Thus, the error e_((n)) is fedback to the system and indirectly affects the output through a change in the parameters (b,w).

The performance surface concept can be extended to p+1 dimensions, making J a paraboloid facing upwards as shown in FIG. 11. The definition of J is in matrix notation but remains quadratic as follows: $\begin{matrix} {J = {\left\lbrack {{W^{T}{RW}} - {2P^{T}W} + {\sum\limits_{n}\quad\frac{d_{(n)}^{2}}{N}}} \right\rbrack.}} & {{Equation}\quad(17)} \end{matrix}$

The values of the coefficients that minimize the solution are: ∇J=0=RW−P or W*=R ⁻¹ P;   Equation (18) which gives exactly the same solution as Equation (15). In the space (w₁, w₂), J is a parabola facing upwards.

Mean square error (J) is analyzed as the parameters of the system w are changed. Without loss of generality, it is assumed that b=0 (or equivalently that the mean of x and d are subtracted), such that J becomes a function of the variable w as follows: $\begin{matrix} {J = {{\frac{1}{2N}{\sum\limits_{n}\quad\left( {{d(n)} - {{wx}(n)}} \right)^{2}}} = {\frac{1}{2N}{\sum\quad{\left( {{{x^{2}(n)}w^{2}} - {2{d(n)}{x(n)}w} + {d^{2}(n)}} \right).}}}}} & {{Equation}\quad(19)} \end{matrix}$ If w is treated as the variable and all other parameters are held constant, J is quadratic on w with the coefficient of w²(x²(n)) which is always positive. In the space of the possible w values, J is a parabola facing upwards (J is always positive since it is a sum of squares). The function J(w) is called the performance surface for the regression problem which is the total error surface plotted in the space of the system coefficients (weights) as shown in FIG. 12. The performance surface is an important tool which visualizes how the adaptation of the weights affects the MSE.

Using the performance surface, a geometric method can be developed for finding the value of w, w*, which minimizes the performance criterion. w* is computed by setting the derivative of J with respect to w equal to zero. The gradient of the performance surface is a vector (with the dimension of J) which always points towards the direction of maximum change and with a magnitude equal to the slope of the tangent of the performance surface as shown in FIG. 13. If the performance surface is visualized as a hillside, each point on the hill has a gradient arrow which points in the steepest direction at that point with larger magnitudes for steeper slopes. Thus, a ball rolling down the hill always attempts to roll in the direction of the gradient arrow, and ends up at the bottom.

In this case, the gradient has two components, one along the weight axis w and another along the J axis ∇J=∇_(W)J+∇_(J)J. The component of the gradient in interest is the one along the unknown parameter direction Δ_(W)J, and only this component will be referred to hereinafter. A graphical way to construct this component of the gradient of J at a point w₀ is to first find the contour curve (curve of constant value) at the point. The gradient is always perpendicular to the contour curve at w₀, with a magnitude given by the partial derivative of J with respect to the weight w: $\begin{matrix} {{\nabla_{w}J} = {\frac{\delta\quad J}{\delta\quad w}.}} & {{Equation}\quad(20)} \end{matrix}$

At the bottom of the bowl, the gradient is zero, because the parabola has slope 0 at the vertex. For a quadratic performance surface, by computing the gradient and equating it to zero, the value of the coefficients that minimize the cost can be found, such as in Equation (8). The analytical solution found by the least squares coincides with the minimum of the performance surface. Substituting the value of w* into Equation (19), the minimum value of the error (J_(min)) can be computed.

For a quadratic performance surface, the value of the coefficients that minimize the cost is as follows: $\begin{matrix} {{{{\nabla J} = {\frac{\delta\quad J}{\delta\quad w} = {0 = {\frac{1}{N}\left( {{- {\sum\limits_{n}\quad{d_{(n)}x_{(n)}}}} + {w{\sum\limits_{n}\quad x_{(n)}^{2}}}} \right)}}}};}{{or};}} & {{Equation}\quad(21)} \\ {w^{*} = {\frac{\sum\limits_{n}\quad{x_{(n)}d_{(n)}}}{\sum\limits_{n}\quad x_{(n)}^{2}}.}} & {{Equation}\quad(22)} \end{matrix}$

This solution is fundamentally the same as found in Equation (8) (b=0 is equivalent to assuming that the average value of x and d are zero). The analytical solution found by the least squares coincides with the minimum of the performance surface. Substituting this value of w* into Equation (19), the minimum value of the error becomes as follows: $\begin{matrix} {J_{(\min)} = {{\frac{1}{2N}\left\lbrack {{\sum\limits_{n}\quad d_{(n)}^{2}} - \frac{\left( {\sum\limits_{n}\quad{d_{(n)}x_{(n)}}} \right)^{2}}{\sum\limits_{n}\quad x_{(n)}^{2}}} \right\rbrack}.}} & {{Equation}\quad(23)} \end{matrix}$

Equation (19) can be re-written as follows: $\begin{matrix} {J = {J_{(\min)} + {\frac{1}{2N}\left( {w - w^{*}} \right){\sum\limits_{n}\quad{{x_{(n)}^{2}\left( {w - w^{*}} \right)}.}}}}} & {{Equation}\quad(24)} \end{matrix}$

Summarizing, the input data controls both the shape of the performance surface as well as the location and value of the minimum. However, changing the desired response only modifies the location of w* and the value of the minimum error.

In multiple variable case, the minimum value of the error can be obtained by substituting Equation (18) into Equation (17), yielding: $\begin{matrix} {J_{(\min)} = {{\frac{1}{2}\left\lbrack {{\sum\limits_{n}\quad\frac{d_{(n)}^{2}}{N}} - {P^{T}W^{*}}} \right\rbrack}.}} & {{Equation}\quad(25)} \end{matrix}$

The performance surface can be rewritten in terms of its minimum value and W* as follows: $\begin{matrix} {J = {J_{(\min)} + {\frac{1}{2}\left( {W - W^{*}} \right)^{T}{{R\left( {W - W^{*}} \right)}.}}}} & {{Equation}\quad(26)} \end{matrix}$

For the one dimensional case, this equation is the same as Equation (24) (R becomes a scalar equal to the power of the input). In the space (w₁, w₂), J is now a parabola facing upwards. The shape of J is solely dependent upon the input data (through its autocorrelation function). The principal axes of the performance surface contours (surfaces of equal error) correspond to the eigenvectors of the input correlation matrix R, while the eigenvalues of R give the rate of change of the gradient along the principal axis of the surface contours of J as shown in FIG. 14.

The eigenvectors and eigenvalues of the input autocorrelation matrix are all what matters to understand convergence of the gradient descent in multiple dimensions. The eigenvectors represent the natural (orthogonal) coordinate system to study the properties of R. In fact, in this coordinate system, the convergence of the algorithm can be studied as a joint adaptation of several (one for each dimension of the space) uni-dimensional algorithms. Along each eigenvector direction the algorithm behaves just like the one variable case. The eigen-value becomes the projection of the data onto that direction just like λ in Equation (25) is the projection of the data on the weight direction.

The location of the performance surface in weight space depends upon both the input and desired response. The minimum error is also dependent upon both data. Multiple regressions find the location of the minimum of a paraboloid placed in an unknown position in the weight space. The input distribution defines the shape of the performance surface. The input distribution and its relation with the desired response distribution define both the value of the minimum of the error and the location in coefficient space where that minimum occurs. The autocorrelation of the input (R) thus completely specifies the shape of the performance surface. However, the location and final error depend also on the desired response.

Search of the performance surface with steepest descent is explained. The method of steepest descent (also known as the gradient method) is the simplest example of a gradient-based method for minimizing a function of several variables.

Since the performance surface is a paraboloid which has a single minimum, an alternate procedure to finding the best value of the coefficient w is to search the performance surface instead of computing the best coefficient analytically by Equation (8). The search for the minimum of a function can be done efficiently using a broad class of methods that use gradient information. The gradient has two main advantages for search: the gradient can be computed locally; and the gradient always points in the direction of maximum change.

If the goal is to reach the minimum, the search must be in the direction opposite to the gradient. So, the overall method of search can be stated in the following way:

The search is started with an arbitrary initial weight w₍₀₎, where the iteration is denoted by the index in parenthesis. Then, the gradient of the performance surface is computed at w₍₀₎, and the initial weight is modified proportionally to the negative of the gradient at w₍₀₎. This changes the operating point to w₍₁₎. Then the gradient is computed at the new position w₍₁₎, and the same procedure is applied again. The process is expressed as follows: w _((k+1)) =w _((k)) −η∇J _((k));   Equation (27) where η is a small constant and ∇J denotes the gradient of the performance surface at the k^(th) iteration. η is used to maintain stability in the search by ensuring that the operating point does not move too far along the performance surface. This search procedure is called the steepest descent method. FIG. 15 illustrates the search procedure.

If the path of the weights is traced from iteration to iteration and the constant η is small, eventually the best value for the coefficient w* will be found. Whenever w>w*, w is decreased, and whenever w<w*, w is increased.

In the multiple variable case, gradient techniques can also be used to find the minimum of the performance surface, but the gradient is a vector with p+1 components as follows: $\begin{matrix} {{\nabla J} = {\left\lbrack {\frac{\delta\quad J}{\delta\quad w_{0}},\ldots\quad,\frac{\delta\quad J}{\delta\quad w_{0}}} \right\rbrack^{T}.}} & {{Equation}\quad(28)} \end{matrix}$

The extension of Equation (27) in which all quantities are vectors: $\begin{matrix} {\left( {{i.e.\quad W_{(k)}} = \left\lbrack {w_{{(0)}_{(k)}},\ldots\quad,w_{{(p)}_{(k)}}} \right\rbrack^{T}} \right),{{{is}\text{:}\quad W_{({k + 1})}} = {W_{(k)} - {\eta{{\nabla\quad J_{(k)}}.}}}}} & {{Equation}\quad(29)} \end{matrix}$

An adaptive system uses the gradient to optimize its parameters. The gradient, however, is usually not known analytically, and thus must be estimated. Traditionally, gradient estimation is done by estimating the derivative using the difference operator. A good estimate, however, requires many small perturbations to the operating point to obtain a robust estimation through averaging. The method is straight forward but not very practical.

In the prior art, use of the instantaneous value of the gradient as the estimator for the true quantity has been disclosed. This means that the summation in Equation (19) is dropped, and the gradient estimate at step k is defined as its instantaneous value. Substituting Equation (6) into Equation (20), removing the summation, and then taking the derivative with respect to w yields: $\begin{matrix} {{\nabla\quad J_{(k)}} = {{\frac{\partial}{\partial w}J_{(k)}} = {{\frac{\partial}{\partial w}\frac{1}{2N}{\sum\quad e^{2}}} \approx {\frac{1}{2}\frac{\partial}{\partial w}{\left( {e_{(k)}^{2}x_{(k)}} \right).}}}}} & {{Equation}\quad(30)} \end{matrix}$

In accordance with Equation (30), an instantaneous estimate of the gradient is simply the product of the input to the weight times the error at iteration k. This means that with one multiplication per weight, the gradient can be estimated. This is the gradient estimate that leads to the famous Least Means Square (LMS) algorithm. However, the estimate is noisy, since the algorithm uses the error from a single sample instead of summing the error for each point in the data set, (e.g. the MSE is estimated by the error for the current sample). The adaptation process does not find the minimum in one step. Many iterations are normally required to find the minimum of the performance surface, and during this process the noise in the gradient is averaged or filtered out.

If the estimator of Equation (30) is substituted in Equation (29), the steepest descent equation becomes: w _((k+1)) =w _((k)) +ηe _((k)) x _((k)).   Equation (31)

This equation is the LMS algorithm. With the LMS rule, one does not need to worry about perturbation and averaging to properly estimate the gradient during iteration. It is the iterative process that improves the gradient estimator. The small constant η is called the step size or the learning rate.

The LMS algorithm for multiple weights is explained. It is straightforward to extend the gradient estimation given by the LMS algorithm from one dimension to many dimensions because the estimation is local. The instantaneous gradient estimate in Equation (30) is applied to each element of Equation (29). The LMS for multiple dimensions reads (W and X are vectors) as follows: W _((k+1)) =W _((k)) +ηe _((k)) X _((k)).   Equation (32)

The LMS adaptation rule uses local computations, therefore it can be rewritten for each weight i as follows: w _((n)) _((k+1)) =w _((n)) _((k)) +ηe _((k)) x _((n)) _((k)) .   Equation (33)

Although the analysis of the gradient descent techniques is complex, the LMS algorithm itself is still very simple. This is one reason why the LMS is so widely used. However, since the LMS is a steepest descent algorithm, the analysis and discussions concerning the largest step size for convergence and coupling of modes also apply to the LMS algorithm.

One of the interesting aspects of the LMS solution is its robustness. No matter what is the initial condition for the weights, the solution always converges to basically the same values. Even some noise can be added to the desired response and the linear regression parameters are basically unchanged. This robustness is important for real world applications, where noise is omnipresent.

The correlation coefficient, c, indicates how much of the variance of d is captured by a linear regression on the independent variable x. As such, c is a very powerful quantifier of the result of the modeling. It has a great advantage with respect to the MSE because it is automatically normalized, while the MSE is not. However, the correlation coefficient is blind to differences in means because it is a ratio of variances (see Equation 11). Therefore, one effectively needs both quantities (c and MSE) when testing the results of regression.

Equation (11) presents a simple way of computing the correlation coefficient requiring only knowledge of y and d. However, y changes during adaptation. So, one should wait until the system adapts to read the final correlation coefficient. During adaptation, the numerator of Equation (11) can be larger than the denominator giving a value for c larger than 1, which is meaningless.

The parameter g during the iterations that will converge to the correlation coefficient when the ADALINE adapts. A term is subtracted from the numerator of Equation (11) that becomes zero at the optimal setting but limits g such that its value is always between −1 and 1 even during adaptation. The parameter g is written as follows: $\begin{matrix} {g = {\frac{\sqrt{\left( {y_{n} - \overset{\_}{d}} \right)^{2}} - \frac{\sum\limits_{n}\quad{e_{(n)}\left( {y_{n} - \overset{\_}{d}} \right)}}{\sqrt{\sum\limits_{n}\quad\left( {y_{n} - \overset{\_}{d}} \right)^{2}}}}{\sqrt{\sum\limits_{n}\quad\left( {d_{(n)} - \overset{\_}{d}} \right)^{2}}}.}} & {{Equation}\quad(34)} \end{matrix}$

However, Equation (34) measures only the correlation coefficient when the ADALINE has been totally adapted to the data. The correlation coefficient can be approximated for the multiple regression case by Equation (34).

During adaptation, the learning algorithm automatically changes the system parameters of Equation (31). The adaptation algorithm has parameters (e.g. the step size) that must be selected. As shown in FIG. 15, when the weights approach the optimum value, the values of J_((w) _((k)) ₎ (the MSE at iteration k) will also decrease, approaching its minimum value J_((min)). One of the best ways to monitor the convergence of the adaptation process is to plot the error value at each iteration. The plot of the MSE across iterations is called the learning curve as shown in FIG. 16. The learning curve is as important for adaptive systems. It is an external, scalar, and easy to compute indication of how well the system is learning. But, it is unspecific, (i.e. when the system is not learning it does not indicate the reason).

The error's rate decrease depends on the value of the step size η. Larger step sizes take less iterations to reach the neighborhood of the minimum provided the adaptation converges.

An adaptive system modifies its weights in an effort to find the best solution. The plot of the value of a weight over time is called the weight tracks. Weight tracks are an important and direct measure of the adaptation process. The problem is that normally the system in accordance with the present invention has many weights and does not know what their optimal values are. Nevertheless the dynamics of learning can be inferred and monitored from the weight tracks.

Weight tracks for one-dimensional case are explained. In the gradient descent adaptation, adjustments to the weights are governed by two quantities: the step size x and the value of the gradient at the point. Even for a constant step size, the weight adjustments become smaller and smaller as the adaptation approaches w*, since the slope of the quadratic performance surface decreases near the bottom of the performance surface. Thus, the weights approach their final values asymptotically as shown in FIG. 17.

Three cases are depicted in FIG. 17. If the step size is small, the weights converge monotonically to w*, and the number of iterations to reach the bottom of the bowl is large. If the step size η is increased, the convergence will be faster but still monotonic. After a value called critically damped, the weights will approximate w* in an oscillatory fashion (η₂η₁), i.e. will overshoot and undershoot the final solution. The number of iterations necessary to reach the neighborhood of w* will increase again. If the step size is too large (η₃>η₂), the iterative process will diverge, i.e. instead of getting closer to the minimum, the search will visit points of larger and larger MSE, until there is a numeric overflow.

The set of points visited by the weights during the adaptation is called the weight track. The weight tracks are perpendicular to the gradient at each point. It is important to provide a graphical construction for the gradient at each point of the contour plot, because once the gradient direction is known, the direction that the next weight is going to be is opposite to the gradient.

Given a point in a contour, the tangent of the contour is taken at the point. The gradient is perpendicular to the tangent, so the weights move along the gradient line and point in the opposite direction. When the eigenvalues are the same, the contour plots are circular and the gradient always points to the center, (i.e. to the minimum). In this case the gradient descent only has a single time constant as in the one dimensional case.

In general, the eigenvalues are different. When the eigenvalues are different, the weight track bends because it follows the direction of the gradient at each point, which is perpendicular to the contours as shown in FIG. 15. So the gradient direction does not point to the minimum, which means that the weight tracks are not straight lines to the minimum. The adaptation takes longer because a longer path to the minimum is taken and the step-size must be decreased compared to the circular case.

Preferably, the largest step size possible is chosen for fastest convergence without creating an unstable system. Since adjustment to the weights is a product of the step size and the local gradient of the performance surface, the largest step size depends also on the shape of the performance surface.

One Dimensional (uni-variable) algorithm.

The shape of the performance surface is controlled by the input data, (see Equation (19)), which is expressed as follows: $\begin{matrix} {J = {J_{\min} + {\frac{1}{2N}\left( {w - w^{*}} \right){\sum\limits_{n}\quad{{x_{n}^{2}\left( {w - w^{*}} \right)}.}}}}} & {{Equation}\quad(35)} \end{matrix}$ The maximum step size is dictated by the input data. If the equations are rewritten such that produce the weight values in terms of the first weight w₍₀₎, the following equation is obtained: w _((k+1)) =w*+(1−ηλ)^(k)(w ₍₀₎ −w);   Equation (36) where; $\begin{matrix} {\lambda = {\frac{1}{N}{\sum\limits_{n}\quad{x_{(n)}^{2}.}}}} & {{Equation}\quad(37)} \end{matrix}$

Since the term (1−ηλ) is exponential, it must be less than one to maintain stability of the weights and to guarantee convergence to 0, giving w_((k+1))=w*). This implies that: $\begin{matrix} {{{\rho } = \left. {{{1 - {\eta\lambda}}} \prec 1}\Rightarrow{\eta \prec \frac{2}{\lambda}} \right.},} & {{Equation}\quad(38)} \end{matrix}$ where ρ is the geometric ratio of the iterative process. Hence, the value of the step size η must always be smaller than 2/λ. The fastest convergence is obtained with the critically damped step size of 1/λ. The closer η is to 1/λ, the faster is the convergence. However, faster convergence also means that the iterative process is closer to instability. In FIG. 13, when η is increased, a monotonic (over-damped) convergence to w* is substituted by an alternating (under-damped) convergence that finally degenerates into divergence.

However, there is a drawback with this approach. During batch learning the weight updates are added together during an epoch to obtain the new weight. This effectively includes a factor of N in the LMS weight update formula in Equation (31). In order to apply the analysis of the largest step size of Equation (38) one has to use a normalized step size: $\begin{matrix} {\eta_{n} = {\frac{\eta}{N}.}} & {{Equation}\quad(39)} \end{matrix}$ However, since the LMS uses an instantaneous (noisy) estimate of the gradient, even when η obeys Equation (38), instability may occur. When the iterative process diverges, the algorithm “forgets” its location in the performance surface, (i.e. the values of the weights will change drastically). This means that all the iterations up to that point were wasted. Hence, with the LMS it is common to include a safety factor of 10 in the largest η(η=0.1/λ).

Multivariable case.

The condition to guarantee convergence is: $\begin{matrix} {{\lim\limits_{k\rightarrow\infty}\left( {I - {\eta\Lambda}} \right)^{k}};} & {{Equation}\quad(40)} \end{matrix}$ where Λ is the eigen-value matrix: $\begin{matrix} {{\Lambda = \begin{bmatrix} \lambda_{0} & \ldots & 0 \\ \ldots & \ldots & \ldots \\ 0 & \ldots & \lambda_{p} \end{bmatrix}},} & {{Equation}\quad(41)} \end{matrix}$ which means that in every principal direction of the performance surface (given by the eigenvectors of the input correlation matrix (R) one must have: $\begin{matrix} {{0 \prec \eta \prec \frac{2}{\lambda_{n}}},} & {{Equation}\quad(42)} \end{matrix}$ where λ_(i) is the corresponding eigen-value. This equation also means that with a single η each weight w_((n)) _((k)) approaches its optimal value w*_((n)) with a different time constant (“speed”). So the weight tracks bend and the path is no longer a straight line towards the minimum.

This is the mathematical description of the gradient descent algorithm that behaves as many one dimensional uni-variable algorithms along the eigenvector directions. Equation (41) is diagonal, so there is no cross-coupling between time constants along the eigenvector directions.

In any other direction of the space, there will be coupling. However, the overall weight tract can be decomposed as a combination of weight tracts along each eigen-direction as shown in FIG. 19. Equation (42) shows that the step-size along each direction obeys the same rule as the uni-dimensional case.

The worst case condition to guarantee convergence to the optimum W* is therefore as follows: $\begin{matrix} {\eta \prec {\frac{2}{\lambda_{\max}}.}} & {{Equation}\quad(43)} \end{matrix}$ The step size η must be smaller than the inverse of the largest eigen-value of the correlation matrix. Otherwise the iteration will diverge in one (or more) direction. Since the adaptation is coupled, divergence in one direction will cause the entire system to diverge. In the early stages of adaptation, the convergence is primarily along the direction of the largest eigen-value.

When the eigen-value spread, (which is the ratio of the largest over the smallest eigen-value), of R is large, there are very different time constants of adaptation in each direction. This reasoning gives a clear picture of the fundamental constraint of adapting the weights using gradient descent with a single step size η. The speed of adaptation is controlled by the smallest eigen-value, while the largest step size is constrained by the inverse of the largest eigen-value. This means that if the eigen-value spread is large, the convergence is intrinsically slow.

The learning curve approaches J_(min) in a geometric progression as before. However, there are many different time constants of adaptation, one per each direction. Initially, the learning curve decreases at the rate of the largest eigen-value, but towards the end of adaptation the rate of decrease of J is controlled by the time constant of the smallest eigen-value.

The eigen-value spread can be computed by an eigen-decomposition of R. However, this is a time consuming operation. An estimate of the eigen-value spread is the ratio between the maximum and the minimum of the magnitude of the Fourier transform of the input data.

Alternatively, simple inspection of the correlation matrix of the input can provide an estimation of the time to find a solution. The best possible case is when R is diagonal with equal values in the diagonal. Because in this case the eigen-value spread is 1 and the gradient descent goes in a straight line to the minimum. This is the fastest convergence. When R is diagonal but with different values, the ratio of the largest number over the smallest is a good approximation to the eigen-value spread. When R is fully populated, the analysis becomes much more difficult. If the non-diagonal terms have values comparable to the diagonal terms, a long training time is required.

An alternative view of the adaptive process is to quantify the convergence of w_((k)) to w* in terms of an exponential decrease. w_((k)) converges to w* as a geometric progression. The envelope of the geometric progression can be approximated by an exponential decrease exp(−t/τ), where τ is the time constant of adaptation.

One dimensional case.

A single iteration can be linked to a time unit. The time constant of adaptation can be written: $\begin{matrix} {{\tau = \frac{1}{\eta\lambda}};} & {{Equation}\quad(44)} \end{matrix}$

which clearly shows that fast adaptation (small time constant τ) requires large step sizes. Equation (45) is obtained by writing exp(−t/τ)=ρ and expanding the exponential in Taylor series and by approximating $\rho \approx {1 - \frac{1}{\tau}}$ whose value is substituted in Equation (38). $\begin{matrix} {\rho = {e^{({- \frac{1}{\tau}})} = {1 - \frac{1}{\tau} + {\frac{1}{{2!}\tau^{2}}{\ldots\quad.}}}}} & {{Equation}\quad(45)} \end{matrix}$

The steps used to derive the time constant of adaptation can be applied to come up with a closed form solution to the decrease of the cost across iterations. Equation (36) indicates how the weights converge to w*. If the equation for the weight recursion is substituted in the equation for the cost where ${\lambda = {\frac{1}{N}{\sum\limits_{n}\quad x_{(n)}^{2}}}},$ Equation (46) is obtained: $\begin{matrix} {{J = {J_{\min} + {{\lambda\left( {1 - {\eta\lambda}} \right)}^{2k}\left( {w_{(0)} - w^{*}} \right)^{2}}}};} & {{Equation}\quad(46)} \end{matrix}$ which means that J also approximates J_(min) in a geometric progression, with a ratio equal to ρ². Therefore the time constant of the learning curve is: $\begin{matrix} {\tau_{mse} = {\frac{\tau}{2}.}} & {{Equation}\quad(47)} \end{matrix}$

Since the geometric ratio is always positive, J approximates J_(min) monotonically (i.e. an exponential decrease). These expressions assume that the adaptation follows the gradient. With the instantaneous estimate used in the LMS, J may oscillate during adaptation since the estimate is noisy. But even in LMS, J approaches J_(min) in one-sided way (i.e. always greater than or equal to J_(min)).

Multivariable case.

When the adaptation approaches W*, the algorithm converges at the speed of the smallest time constant. Therefore, the time constant of adaptation is: $\begin{matrix} {\tau = {\frac{1}{{\eta\lambda}_{\min}}.}} & {{Equation}\quad(48)} \end{matrix}$

For fast convergence a large step size (η) is needed. When the search is close to the minimum w*, the gradient is small, but not zero, and the iterative process continues to wander around a neighborhood of the minimum solution without ever stabilizing. This phenomenon is called rattling as shown in FIG. 19, in which the basin increases proportionally to the step size η. This means that when the adaptive process is stopped by an external command, (such as the number of passes through the data), the weights may not be exactly at w* and not exactly at the optimum.

In accordance with the performance surface in FIG. 19, when the final weights are not at w*, the performance is not optimum, i.e. the final MSE will be higher than J_(min). The difference between the final MSE and the J_(min) (normalized by J_(min)) is called the miss-adjustment M: $\begin{matrix} {M = {\frac{J_{final} - J_{\min}}{J_{\min}}.}} & {{Equation}\quad(49)} \end{matrix}$

This means that in search procedures that use gradient descent there is an intrinsic compromise between accuracy of the final solution, (small miss-adjustment which is the normalized excess MSE produced by the rattling), and speed of convergence. The parameter that controls this compromise is the step size η. High η means fast convergence but also large miss-adjustment, while small η means slow convergence but little miss-adjustment.

For fast convergence to the neighborhood of the minimum, a large step size is desired. However, the solution with a large step size suffers from rattling. One solution is to use a large learning rate in the beginning of training to move quickly towards the location of the optimum weights but then to decrease the learning rate to obtain good accuracy on the final weight values. This is called learning rate scheduling. This simple idea can be implemented with a variable step size controlled by: η_((n+1))=η_((n))−β,   Equation (50) where η(0)=η0 is the initial step size, and β is a small constant. The step size is linearly decreased each iteration. If one has control of the number of iterations, the process can be started with a large step size which is decreased to practically zero. The value of β needs to be experimentally found. Alternatively, the step size can be annealed geometrically, or logarithmically.

If the initial value of η₀ is set too high, the learning can diverge. The selection of β can be even trickier than the selection of η because it is highly dependent on the performance surface. If β is too large, the weights may not move quickly enough to the minimum and the adaptation may stall. If β is too small, then the search may reach the global minimum quickly and must wait a long time before the learning rate decreases enough to minimize the rattling.

Newton's method is known to find the roots of quadratic equations in one iteration. The minimum of the performance surface can be equated to finding the root of the gradient equation of Equation (28). The adaptive weight equation using the Newton's method is: W _((k+1)) =W _((k)) −R ⁻¹ ∇J _((k)).   Equation (51)

Comparing with Equation (29), the gradient information is weighted by the inverse of the correlation matrix of the input, and η is equal to one. This means that Newton's method corrects the direction of the search such that it always points to the minimum, while the gradient descent points to the maximum direction of change. These two directions may or may not coincide (see FIG. 20).

They coincide when the contour plots are circles, i.e. when the largest and the smallest eigen-value of the correlation matrix are the same. When the ratio of the largest to the smallest eigen-value increases (the eigen-value spread), the slope of the performance surface in the two directions differs more and more. Accordingly, for large eigen-value spreads, the optimization path taken by gradient descent is normally much longer than the path taken by Newton's method. This implies that Newton's method will be faster than LMS when the input data correlation matrix has a large eigen-value spread.

Another advantage of Newton's method versus the steepest descent is in terms of geometric ratios or time constant of adaptation. When the gradient is multiplied by R⁻¹ not only the direction of the gradient is being changed but the different eigen-values in each direction are being equalized. What this means is that Newton's method is automatically correcting the time constant of adaptation for each direction such that all the weights converge at the same rate. Hence, Newton's method has a single time constant of adaptation, unlike the steepest descent method.

Newton's method uses much more information about the performance surface (the curvature). In fact, to implement Newton's method, the inverse of the correlation matrix should be computed which takes significantly longer than a single multiplication in the LMS method and also requires global information. Newton's method is also brittle; i.e. if the surface is not exactly quadratic, the method may diverge. This is the reason Newton's method is modified as follows to have also a small step size η instead of using one in Equation (51). W _((k+1)) =W _((k)) +ηR ⁻¹ e _((k)) X _((k)).   Equation (52) X_((k)) is a vector and R⁻¹ is a matrix, so the update for one weight influences all the other inputs in the system. This is the reason that the computations are no longer local to each weight.

Alternatively, to improve convergence speed with the LMS, an orthogonalizing transformation of the input correlation function followed by an equalization of the eigenvalues which is called a whitening transformation can be implemented. Since Newton's method coincides with the steepest descent for performance surfaces that are symmetric, this preprocessing makes the LMS perform as Newton.

Multiple regression with bias.

If the bias is added, the computation involves three parameters. The largest step size in the case with bias is different since the input data is effectively changed if one interprets the bias as a weight connected to an extra constant input of one. Hence the autocorrelation function and its eigen-value spread are changed.

The use of a bias is called the full least square solution, and it is the recommended way to apply least squares. When a bias is utilized in the processing element (PE), the regression line is not restricted to pass through the origin of the space, and smaller errors are normally achieved. There are two equivalent ways to solve the full least squares problem for D input variables.

First, input and desired responses need to be modified such that they become zero mean variables. This is called the deviation or z scores or variables that have been normalized to zero mean and unit standard deviation. In this case, a D weight regression effectively solves the original problem. The bias b is computed indirectly by: $\begin{matrix} {{b = {\overset{\_}{d} - {\sum\limits_{n = 1}^{N}\quad{w_{(n)}{\overset{\_}{x}}_{(n)}}}}};} & {{Equation}\quad(53)} \end{matrix}$ where w(i) are the optimal weights and the bars represent mean values.

In a second alternative, the input matrix has to be extended with an extra column of 1s (the first column). This transforms R into a (D+1)×(D+1) matrix, which introduces a D+1 weight in the solution (the bias).

The LMS algorithm in practice.

The step size should be normalized by the variance of the input data estimated by the trace of R as follows: $\begin{matrix} {{\eta = \frac{\eta_{0}}{{tr}(R)}};} & {{Equation}\quad(54)} \end{matrix}$ where η₀=0.1 to 0.01. The algorithm is expected to converge (settling time) in a number of iterations k given by four times the time constant of adaptation: $\begin{matrix} {{k \approx {4\quad\tau_{mse}}} = {\frac{2}{{\eta\lambda}_{\min}}.}} & {{Equation}\quad(55)} \end{matrix}$

The LMS algorithm has a miss-adjustment that is basically one half the trace of R times η: $\begin{matrix} {M = {\frac{\eta}{2}{{{tr}(R)}.}}} & {{Equation}\quad(56)} \end{matrix}$

When the eigenvalues are equal, the miss-adjustment can be approximated by: $\begin{matrix} {M \approx {\frac{D + 1}{4\quad\tau_{mse}}.}} & {{Equation}\quad(57)} \end{matrix}$ Therefore, the miss-adjustment equals the number of weights divided by the settling time, or equivalently, selecting η so that it produces 10% miss-adjustment means that a training duration in iterations becomes 10 times the number of inputs. Jaber product ({circumflex over (*)}_((α,γ,β))).

For a given r×r square matrix T_(r) and for a given column vector x_((n)) of size N(N=r^(n)), the Jaber product is defined expressed with the operator {circumflex over (*)}_((α,γ,β)) (Jaber product of radix α performed on y column vectors of size β) by the following operation where the γ column vectors are subsets of x_((n)) picked up at a stride α: $\begin{matrix} \begin{matrix} {X_{(k)} = {{\hat{*}}_{({r,r,{N/r}})}\left( {T_{r},\begin{bmatrix} x_{({rn})} \\ x_{({{rn} + 1})} \\ . \\ x_{({{rn} + {({r - 1})}})} \end{bmatrix}} \right)}} \\ {= {T_{r} \times \begin{bmatrix} x_{({rn})} \\ x_{({{rn} + 1})} \\ . \\ x_{({{rn} + {({r - 1})}})} \end{bmatrix}}} \\ {= {\begin{bmatrix} T_{0,0} & T_{0,1} & . & T_{0,{({r - 1})}} \\ T_{1,0} & T_{1,1} & . & T_{1,{({r - 1})}} \\ . & . & . & . \\ T_{{({r - 1})},0} & T_{{({r - 1})},1} & . & T_{{({r - 1})},{({r - 1})}} \end{bmatrix} \times}} \\ {{col}\left\lbrack x_{({{rn} + j_{0}})} \right\rbrack} \\ {{= {\left\lbrack {\sum\limits_{j_{0} = 0}^{r - 1}\quad{T_{({l,j_{0}})}x_{({{rn} + j_{0}})}}} \right\rbrack\quad{for}}}\quad} \\ {{k = 0},1,\ldots\quad,{{\left( \frac{N}{R} \right) - {1\quad{and}\quad l}} = 0},1,\ldots\quad,{r - 1}} \end{matrix} & {{Equation}\quad(58)} \end{matrix}$ X_((k)) is a column vector or column vectors of length γ×β in which the l^(th) element Y₁ of the k^(th) product Y_((l,k)) is labeled as: l _((k)) =l×(λ×γ×β)+k;   Equation (59) for k=0, 1, . . . , (λ×β)−1 and j₀=0, 1, . . . , r−1 and where λ is a power of r.

This is viewed as a column vector composed of r column vectors of length (λ×β) where λ is a power of r. Here the l^(th) element Y_(l) of the k^(th) product Y_((l,k)) is indexed as in Equation (61).

The spatial Radix-r signal factorization.

Based on the proposition for the Jaber product, discrete signals can be decomposed into r partial signals when T_(r)=I_(r) (where I_(r) is an identity matrix). In fact, for any given discrete signal x_((n)) it can be written as follows: $\begin{matrix} \begin{matrix} {x_{(k)} = {{\hat{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},\begin{bmatrix} x_{({rn})} \\ x_{({{rn} + 1})} \\ . \\ x_{({{rn} + {({r - 1})}})} \end{bmatrix}} \right)}} \\ {{= {\begin{bmatrix} 1 & 0 & \ldots & 0 \\ 0 & 1 & \ldots & 0 \\ \ldots & \ldots & \ldots & \ldots \\ 0 & 0 & \ldots & 1 \end{bmatrix} \times {{col}\left\lbrack x_{({{rn} + j_{0}})} \right\rbrack}}};} \end{matrix} & {{Equation}\quad(60)} \end{matrix}$ which is a column vector or column vectors of length π×N where λ(λ=N/r) is a power of r⁻¹ in which the l^(th) element x_(l) of the k^(th) product x_((l,k)) is labeled as: l _((k)) =l×(N/r)+k.   Equation (61)

The parallel implementation of the regression method.

The relationship between d and x stated in Equation (1) can be expressed in terms of the Jaber Product as follows: $\begin{matrix} \begin{matrix} {{d_{(n)} \approx {{wx}_{(n)} + b}} = Y_{(n)}} \\ {= {{w\left( {{\hat{*}}_{({r,{r\frac{N}{r}}})}\left( {I_{r},{{col}\left\lbrack x_{({{rn} + j_{0}})} \right\rbrack}} \right)} \right)} + b}} \\ {= {{{\hat{*}}_{({r,{r\frac{N}{r}}})}\left( {I_{r},{{col}\left\lbrack {wx}_{({{rn} + j_{0}})} \right\rbrack}} \right)} + b}} \\ {{= {{{{\hat{*}}_{({r,{r\frac{N}{r}}})}\left( {I_{r},{{col}\left\lbrack y_{({{rn} + j_{0}})} \right\rbrack}} \right)} + b} = Y_{(n)}}};} \end{matrix} & {{Equation}\quad(62)} \end{matrix}$ for j₀=1, . . . , r−1.

FIGS. 21 and 22 are diagrams of parallel processing elements 100 a, 100 b; and FIGS. 23 and 24 are diagrams of a parallel linear regressor 200 a, 200 b, in accordance with the present invention. The signal can be decomposed in a proper manner for parallel processing in order to speed up the process.

The parallel processing element 100 a, 100 b comprises a set of multiples 102 ₀-102 _(r−1), (r multipliers in FIG. 21), and a Jaber product device 104. The processing element 100 a of FIG. 21 further comprises an additional multiplier 106 and an adder 108 for bias. N input data are subdivided and each of the plurality of subdivided data enters the multipliers 102 ₀-102 _(r−1) simultaneously. The multipliers 102 ₀-102 _(r−1) multiply the inputs with corresponding weights and forwards multiplication outputs to the Jaber product device 104. In order to obtain the overall system driven output in a normal order (i.e for n=0,1, . . . ), the Jaber product device 104 rearranges the multiplication results and generates a regression output.

The parallel linear regressor 200 a (or 200 b) comprises a parallel processing element 100 a (or 100 b) and an adder 110. The parallel processing element 100 a (or 100 b) generates regression outputs and the outputs are compared with a desired value by the adder 110. The adder 110 subtracts the regression output from the desired value to generate error signals. The parallel linear regressor 200 a, 200 b in accordance with the present invention generates error signals that are in the best fitting of the input data.

The structure of the parallel linear regressor can be optimized by manipulating Equation (2). Equation (2) can be expressed in terms of the Jaber Product as follows: $\begin{matrix} \begin{matrix} {e_{(n)} = {{d_{(n)} - y_{(n)}} = {d_{(n)} - \left( {{wx}_{(n)} + b} \right)}}} \\ {= {{\hat{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack {d_{({{rn} + j_{0}})} - y_{({{rn} + j_{0}})}} \right\rbrack}} \right)}} \\ {= {{\hat{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack {d_{({{rn} + j_{0}})} - \left( {{wx}_{({{rn} + j_{0}})} + b} \right)} \right\rbrack}} \right)}} \\ {= {{\hat{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack e_{({{rn} + j_{0}})} \right\rbrack}} \right)}} \end{matrix} & {{Equation}\quad(63)} \end{matrix}$ for j₀=1, . . . , r−1. FIGS. 25 and 26 are diagrams of optimized structure of the parallel linear regressor 300 a, 300 b. The parallel linear regressor 300 a, 300 b comprises a plurality of multipliers 102 ₀-102 _(r−1), a plurality of adders 103 ₀-103 _(r−1) and a Jaber product device 104. The multiplier 102 ₀-102 _(r−1) multiplies the input with a corresponding weight and the adder 103 ₀-103 _(r−1) subtracts the multiplication result from the desired value in order to generate individual error signals. The Jaber product device 104 rearranges the individual error signals from the adders 103 ₀-103 _(r−1) and generates a final error signal for the input data. The parallel linear regressor 300 a further comprises additional multipliers 106 ₀-106 _(r−1) for bias. FIGS. 27 and 28 are diagrams of partial linear regressor which comprises the parallel linear regressor. Each result obtained from each set of data is called partial linear regression computed on a partial linear regressor.

In FIG. 23, the regression is computed on r parallel regressor and then the error is computed by subtracting the overall system output and the desired signal. In FIG. 25, the desired response is also subdivided in a proper manner like the input signal where the partial error is computed locally and the overall error system is reconstructed.

Parallel regressor for multiple variables.

The relationship between d and x stated in Equation (3) for the multiple variable cases can be expressed in terms of Jaber product as follows: $\begin{matrix} \begin{matrix} {e_{(n)} = {d_{(n)} - \left( {b + {\sum\limits_{k = 0}^{p}\quad{w_{(k)}x_{({n,k})}}}} \right)}} \\ {= {d_{(n)} - \left( {b + {\sum\limits_{k = 0}^{p}\quad{{\hat{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack {w_{(k)}x_{({{({{rn} + j_{0}})},k})}} \right\rbrack}} \right)}}} \right)}} \\ {= {d_{(n)} - \left( {b + {\sum\limits_{k = 0}^{p}\quad{{\hat{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack y_{({{({{rn} + j_{0}})},k})} \right\rbrack}} \right)}}} \right)}} \\ {{= {{d_{(n)} - \left( {b + {{\hat{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack {\sum\limits_{k = 0}^{p}\quad y_{({{({{rn} + j_{0}})},k})}} \right\rbrack}} \right)}} \right)} = {d_{(n)} - Y_{(n)}}}},} \end{matrix} & {{Equation}\quad(64)} \end{matrix}$ for j₀=1, . . . , r−1. FIGS. 29-31 are diagrams of the parallel regressor system for multiple variable case in accordance with the present invention.

Similarly, by expressing Equation (64) in terms of Jaber product yields to: $\begin{matrix} \begin{matrix} {e_{(n)} = {d_{(n)} - \left( {b + {\sum\limits_{k = 0}^{p}\quad{w_{(k)}x_{({n,k})}}}} \right)}} \\ {= {{\hat{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack {d_{({{rn} + j_{0}})} - \left( {b +} \right.} \right.}} \right.}} \\ \left. \left. \left. {\sum\limits_{k = 0}^{p}\quad{w_{(k)}x_{({{({{rn} + j_{0}})},k})}}} \right) \right\rbrack \right) \\ {= {{\hat{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack e_{({{rn} + j_{0}})} \right\rbrack}} \right)}} \end{matrix} & {{Equation}\quad(65)} \end{matrix}$ for j₀=1, . . . , r−1. FIGS. 32 and 33 are diagrams of the optimized structure of the parallel regressor system and FIGS. 34 and 35 are diagrams of partial regressor system in accordance with the present invention.

The parallel implementation of the least squares method.

The method of least squares assumes that the best-fit curve of a given type is the curve that has the minimal sum of the deviations squared (least square error) from a given set of data. Suppose that the N data points are (x₀, y₀), (x₁, y₁), . . . , (x_((n−1)), y_((n−1))), where x is the independent variable and y is the dependent variable. The fitting curve d has the deviation (error) σ from each data point, i.e., σ₀=d₀−y₀, σ₁=d₁−y₁, σ_((n−1))=d_((n−1))−d_((n−1)) which can be re-ordered as follows: σ₀ =d ₀ y ₀, σ_(r) =d _(r) −y _(r), σ_(2r) =d _(2r) −y _(2r), . . . , σ_(rn) =d _(rn) −y _(rn), σ₁ =d ₁ −y ₁, . . . , σ_((rn+1)) =d _((rn+1)) −y _((rn+1)), . . . , σ_((rn+(r−1))) =d _((rn+(r−1))) −y _((rn+(r−1)))   Equation (66) for n=0, 1, . . . , (N/r)−1.

According to the method of least squares, the best fitting curve has the property that: $\begin{matrix} \begin{matrix} {J = {\sigma_{0}^{2} + \ldots + \sigma_{rn}^{2} + \sigma_{1}^{2} + \ldots + \sigma_{({{rn} + 1})}^{2} + \ldots +}} \\ {\sigma_{({{rn} + {({r - 1})}})}^{2}} \\ {= {\sum\limits_{j_{0} = 0}^{r - 1}\quad{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}\quad\left\lbrack d_{{({{rn} + j_{0}})} - {y{({{rn} + j_{0}})}}} \right\rbrack^{2}}}} \\ {= {{\sum\limits_{j_{0} = 0}^{r - 1}\quad{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}\quad\sigma_{({{rn} + j_{0}})}^{2}}} = {a\quad{minimum}}}} \end{matrix} & {{Equation}\quad(67)} \end{matrix}$

It has been shown that Equation (5) can be expanded in terms of Jaber product as follows: $\begin{matrix} {\begin{matrix} {e_{(n)} = {{d_{(n)} - \left( {b + {wx}_{(n)}} \right)} = {d_{(n)} - y_{(n)}}}} \\ {= {{\hat{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack {d_{({{rn} + j_{0}})} - \left( {b + {wx}_{({{rn} + j_{0}})}} \right)} \right\rbrack}} \right)}} \\ {= {{{\hat{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack {d_{({{rn} + j_{0}})} - y_{({{rn} + j_{0}})}} \right\rbrack}} \right)} =}} \\ {{\hat{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack e_{({{rn} + j_{0}})} \right\rbrack}} \right)} \end{matrix};} & {{Equation}\quad(68)} \end{matrix}$ for j₀=1, . . . , r−1 and in order to pick the line which best fits the data, MSE is utilized as a performance criterion. $\begin{matrix} \begin{matrix} {J = {{\frac{1}{2N}{\sum\limits_{n = 0}^{N - 1}\quad e_{(n)}^{2}}} =}} \\ {\frac{1}{2N}{\sum\limits_{j_{0} = 0}^{r - 1}\quad{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}\quad\left( {{\hat{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack e_{({{rn} + j_{0}})} \right\rbrack}} \right)} \right)^{2}}}} \\ {= {\left( \frac{1}{r} \right){\sum\limits_{j_{0} = 0}^{r - 1}\quad\left( \frac{1}{2\left( \frac{N}{r} \right)} \right)}}} \\ {\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}\quad\left( {{\hat{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack e_{({{rn} + j_{0}})} \right\rbrack}} \right)} \right)^{2}} \\ {= {\left( \frac{1}{r} \right){\sum\limits_{j_{0} = 0}^{r - 1}\quad\left( \hat{*} \right)_{({r,r,\frac{N}{r}})}}}} \\ {\left( {I_{r},{{col}\left\lbrack {\left( \frac{1}{2\left( \frac{N}{r} \right)} \right){\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}\quad e_{({{rn} + j_{0}})}^{2}}} \right\rbrack}} \right) =} \\ {{{\left( \frac{1}{r} \right){\sum\limits_{j_{0} = 0}^{r - 1}\quad{\left( \hat{*} \right)_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack J_{j_{0}} \right\rbrack}} \right)}}} = {\left( \frac{1}{r} \right)\sum\limits_{j_{0} = 0}^{r - 1}}}\quad} \end{matrix} & {{Equation}\quad(69)} \end{matrix}$ where J_(j0) is the partial MSE applied on the subdivided data.

The goal is to minimize J analytically, which can be done by taking its partial derivative with respect to the unknowns and equating the resulting equations to zero as follows: $\begin{matrix} \left\{ {\begin{matrix} {\frac{\partial J}{\partial b} = 0} \\ {\frac{\partial J}{\partial w} = 0} \end{matrix};} \right. & {{Equation}\quad(70)} \end{matrix}$ which yields: $\begin{matrix} \left\{ {\begin{matrix} {\frac{\partial J}{\partial b} = {\frac{\partial\left( {\left( \frac{1}{r} \right){\sum\limits_{j_{0} = 0}^{r - 1}\quad{{\hat{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack J_{j_{0}} \right\rbrack}} \right)}}} \right)}{\partial b} = {{\left( \frac{1}{r} \right){\sum\limits_{j_{0} = 0}^{r - 1}\quad{{\hat{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack \frac{\partial\left( J_{j_{0}} \right)}{\partial b} \right\rbrack}} \right)}}} = {{\left( \frac{1}{r} \right){\sum\limits_{j_{0} = 0}^{r - 1}\quad\frac{\partial\left( J_{j_{0}} \right)}{\partial b}}} = 0}}}} \\ {\frac{\partial J}{\partial b} = {{\frac{\partial\left( {\left( \frac{1}{r} \right){\sum\limits_{j_{0} = 0}^{r - 1}\quad{{\hat{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack J_{j_{0}} \right\rbrack}} \right)}}} \right)}{\partial w}\left( \frac{1}{r} \right){\sum\limits_{j_{0} = 0}^{r - 1}\quad{{\hat{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack \frac{\partial\left( J_{j_{0}} \right)}{\partial w} \right\rbrack}} \right)}}} = {{\left( \frac{1}{r} \right){\sum\limits_{j_{0} = 0}^{r - 1}\quad\frac{\partial\left( J_{j_{0}} \right)}{\partial w}}} = 0}}} \end{matrix};} \right. & {{Equation}\quad(71)} \\ \left\{ \begin{matrix} {\frac{\partial J}{\partial b} = {{\left( \frac{1}{r} \right){\sum\limits_{j_{0} = 0}^{r - 1}\quad{{\hat{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack \frac{\partial\left( {\left( \frac{1}{2\left( \frac{N}{r} \right)} \right){\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}\quad\left( {d_{({{rn} + j_{0}})} - \left( {b + {wx}_{({{rn} + j_{0}})}} \right)} \right)^{2}}} \right)}{\partial b} \right\rbrack}} \right)}}} = 0}} \\ {{\frac{\partial J}{\partial w} = {{\left( \frac{1}{r} \right){\sum\limits_{j_{0} = 0}^{r - 1}\quad{\left( \hat{*} \right)_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack \frac{\partial\left( {\left( \frac{1}{2\left( \frac{N}{r} \right)} \right){\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}\quad\left( {d_{({{rn} + j_{0}})} - \left( {b + {wx}_{({{rn} + j_{0}})}} \right)} \right)^{2}}} \right)}{\partial w} \right\rbrack}} \right)}}} = 0}};} \end{matrix} \right. & {{Equation}\quad(72)} \end{matrix}$ Therefore, $\begin{matrix} \left\{ {\begin{matrix} {{{- \frac{1}{r}}{\sum\limits_{j_{0} = 0}^{r - 1}\quad{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}\quad{\left( \frac{1}{\left( \frac{N}{r} \right)} \right)\left( {d_{({{rn} + j_{0}})} - \left( {b + {wx}_{({{rn} + j_{0}})}} \right)} \right)}}}} = 0} \\ {{\frac{1}{r}{\sum\limits_{j_{0} = 0}^{r - 1}\quad{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}\quad{\left( \frac{1}{\left( \frac{N}{r} \right)} \right)\left( {d_{({{rn} + j_{0}})} - \left( {b + {wx}_{({{rn} + j_{0}})}} \right)} \right)x_{({{rn} + j_{0}})}}}}} = 0} \end{matrix};} \right. & {{Equation}\quad(73)} \end{matrix}$ which yields: $\begin{matrix} \begin{matrix} {{\sum\limits_{j_{0} = 0}^{r - 1}\quad{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}\quad d_{({{rn} + j_{0}})}}} = {{Nb} + {w{\sum\limits_{j_{0} = 0}^{r - 1}\quad{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}\quad x_{({{rn} + j_{0}})}}}}}} \\ {{\sum\limits_{j_{0} = 0}^{r - 1}\quad{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}\quad{d_{({{rn} + j_{0}})}x_{({{rn} + j_{0}})}}}} = {{b{\sum\limits_{j_{0} = 0}^{r - 1}\quad{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}\quad x_{({{rn} + j_{0}})}}}} + {w{\sum\limits_{j_{0} = 0}^{r - 1}\quad{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}\quad x_{({{rn} + j_{0}})}^{2}}}}}} \end{matrix} & {{Equation}\quad(74)} \end{matrix}$

The set of Equation (74) is called the normal equations. The solution of this set of equations is: $\begin{matrix} {{b = \frac{{\sum\limits_{j_{0} = 0}^{r - 1}\quad{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}\quad{\left( x_{({{rn} + j_{0}})} \right)^{2}{\sum\limits_{j_{0} = 0}^{r - 1}\quad{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}\quad y_{({{rn} + j_{0}})}}}}}} - {\sum\limits_{j_{0} = 0}^{r - 1}\quad{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}\quad{x_{({{rn} + j_{0}})}{\sum\limits_{j_{0} = 0}^{r - 1}{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}{x_{({{rn} + j_{0}})}d_{({{rn} + j_{0}})}}}}}}}}{N{\sum\limits_{j_{0} = 0}^{r - 1}\quad{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}\quad\left( {x_{({{rn} + j_{0}})} - {\overset{\_}{x}}_{({{rn} + j_{0}})}} \right)}}}};} & {{Equation}\quad(75)} \end{matrix}$ which yields after simplification to: $\begin{matrix} {b = {{\frac{1}{r}{\sum\limits_{j_{0} = 0}^{r - 1}\quad\left( \frac{{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}\quad{\left( x_{({{rn} + j_{0}})} \right)^{2}{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}\quad y_{({{rn} + j_{0}})}}}} - {\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}\quad{x_{({{rn} + j_{0}})}{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}\quad{x_{({{rn} + j_{0}})}d_{({{rn} + j_{0}})}}}}}}{\left( \frac{N}{r} \right){\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}\quad\left( {x_{({{rn} + j_{0}})} - {\overset{\_}{x}}_{({{rn} + j_{0}})}} \right)}} \right)}} = {\frac{1}{r}{\sum\limits_{j_{0} = 0}^{r - 1}\quad b_{j_{0}}}}}} & {{Equation}\quad(76)} \end{matrix}$ for j₀=1, . . . , r−1. With the same analogy the second unknown is computed to obtain: $\begin{matrix} {{w = {\frac{1}{r}{\sum\limits_{j_{0} = 0}^{r - 1}w_{j_{0}}}}};} & {{Equation}\quad(77)} \end{matrix}$ where b_(j) ₀ and w_(j) ₀ are the solutions of the partial least square applied on the decomposed data.

The parallel implementation of the least squares for multiple variable case.

The MSE of Equation (9) after being expanded in terms of Jaber product becomes: $\begin{matrix} \begin{matrix} {J = {\frac{1}{2N}{\sum\limits_{n = 0}^{N - 1}\left( {d_{(n)} - {\sum\limits_{k = 0}^{p}{w_{(k)}x_{({n,k})}}}} \right)^{2}}}} \\ {= {\frac{1}{r}{\sum\limits_{j_{0} = 0}^{r - 1}{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}{\left( \frac{\frac{1}{N}}{r} \right)\left( {{\hat{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack {d_{({{rn} + j_{0}})} -} \right.}} \right.} \right.}}}}} \\ \left. \left. \left. {\sum\limits_{k = 0}^{p}{w_{({{({{rn} + j_{0}})},k})}x_{({{({{rn} + j_{0}})},k})}}} \right\rbrack \right) \right)^{2} \\ {= {\frac{1}{r}\left( {\sum\limits_{j_{0} = 0}^{r - 1}{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}{\frac{1}{\frac{N}{r}}\left\lbrack {d_{({{rn} + j_{0}})} -} \right.}}} \right.}} \\ \left. \left. {\sum\limits_{k = 0}^{p}{w_{({{({{rn} + j_{0}})},k})}x_{({{({{rn} + j_{0}})},k})}}} \right\rbrack^{2} \right) \\ {= {\frac{1}{r}{\sum\limits_{j_{0} = 0}^{r - 1}J_{j_{0}}}}} \end{matrix} & {{Equation}\quad(78)} \end{matrix}$ for j₀=1, . . . , r−1 and where J_(j) ₀ is the partial MSE applied on the subdivided data.

The solution to the extreme (minimum) of this equation can be found in exactly the same way as before. By taking the derivatives of J with respect to the unknowns (w_((k))), and equating the result to zero. This yields a set of p+1 equations in p+1 unknowns called normal equations. $\begin{matrix} {{{\sum\limits_{j_{0} = 0}^{r - 1}{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}{x_{({{({{rn} + j_{0}})},j})}d_{({({{rn} + j_{0}})})}}}} = {\sum\limits_{j_{0} = 0}^{r - 1}{\sum\limits_{k = 0}^{p}{w_{(k)}{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}{x_{({{({{rn} + j_{0}})},k})}x_{({{({{rn} + j_{0}})},j})}}}}}}};} & {{Equation}\quad(79)} \end{matrix}$ for j=0, 1, . . . , p.

Parallel Correlation Coefficient.

It is necessarily to calculate a measure of how successfully the regression line represents the relationship between x and d. The size of the MSE (J) can be used to determine which line best fits the data, but it doesn't necessarily reflect whether a line fits the data at all. The size of the MSE is dependent upon the number of data samples and the magnitude (or power) of the data samples. For instance, by simply scaling the data, the MSE can be changed without changing how well the data is fit by the regression line. The correlation coefficient (c) solves this problem by comparing the variance of the predicted value with the variance of the desired value. The value c² represents the amount of variance in the data captured by the linear regression: $\begin{matrix} \begin{matrix} {c^{2} = \frac{\sum\limits_{n}\left( {y_{(n)} - \overset{\_}{d}} \right)^{2}}{\sum\limits_{n}\left( {d_{(n)} - \overset{\_}{d}} \right)^{2}}} \\ {= \frac{\sum\limits_{j_{0} = 0}^{r - 1}{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}{{\overset{\Cap}{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack \left( {y_{({{rn} + j_{0}})} - {\overset{\_}{d}}_{({{rn} + j_{0}})}} \right)^{2} \right\rbrack}} \right)}}}{\sum\limits_{j_{0} = 0}^{r - 1}{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}{{\overset{\Cap}{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack \left( {d_{({{rn} + j_{0}})} - {\overset{\_}{d}}_{({{rn} + j_{0}})}} \right)^{2} \right\rbrack}} \right)}}}} \\ {= {\frac{1}{r}{\sum\limits_{j_{0} = 0}^{r - 1}\frac{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}{{\overset{\Cap}{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack \left( {y_{({{rn} + j_{0}})} - {\overset{\_}{d}}_{({{rn} + j_{0}})}} \right)^{2} \right\rbrack}} \right)}}{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}{{\overset{\Cap}{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack \left( {d_{({{rn} + j_{0}})} - {\overset{\_}{d}}_{({{rn} + j_{0}})}} \right)^{2} \right\rbrack}} \right)}}}}} \\ {= {\frac{1}{r}{\sum\limits_{j_{0} = 0}^{r - 1}c_{j_{0}}^{2}}}} \end{matrix} & {{Equation}\quad(80)} \end{matrix}$

If y_((n)) is substituted by the equation of the regression line and operate, a correlation coefficient is obtained as follows: $\begin{matrix} \begin{matrix} {c = \frac{\frac{\sum\limits_{n}{\left( {x_{(n)} - \overset{\_}{x}} \right)\left( {d_{(n)} - \overset{\_}{d}} \right)}}{N}}{\sqrt{\frac{\sum\limits_{n}\left( {d_{(n)} - \overset{\_}{d}} \right)^{2}}{N}}\sqrt{\frac{\sum\limits_{n}\left( {x_{(n)} - \overset{\_}{x}} \right)^{2}}{N}}}} \\ {= {\left( \frac{1}{r} \right)\frac{\frac{\sum\limits_{j_{0} = 0}^{r - 1}{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}{{\overset{\Cap}{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack {\left( {x_{({{rn} + j_{0}})} - {\overset{\_}{x}}_{({{rn} + j_{0}})}} \right)\left( {d_{({{rn} + j_{0}})} - {\overset{\_}{d}}_{({{rn} + j_{0}})}} \right)} \right\rbrack}} \right)}}}{N}}{\sqrt{\frac{\sum\limits_{j_{0} = 0}^{r - 1}{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}{{\overset{\Cap}{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack \left( {d_{({{rn} + j_{0}})} - {\overset{\_}{d}}_{({{rn} + j_{0}})}} \right)^{2} \right\rbrack}} \right)}}}{N}\sqrt{\frac{\sum\limits_{j_{0} = 0}^{r - 1}{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}{{\overset{\Cap}{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack \left( {x_{({{rn} + j_{0}})} - {\overset{\_}{x}}_{({{rn} + j_{0}})}} \right)^{2} \right\rbrack}} \right)}}}{N}}}}}} \\ {= {\left( \frac{1}{r} \right){\sum\limits_{j_{0} = 0}^{r - 1}\frac{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}{{\overset{\Cap}{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack {\left( {x_{({{rn} + j_{0}})} - {\overset{\_}{x}}_{({{rn} + j_{0}})}} \right)\left( {d_{({{rn} + j_{0}})} - {\overset{\_}{d}}_{({{rn} + j_{0}})}} \right)} \right\rbrack}} \right)}}{\sqrt{\frac{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}{{\overset{\Cap}{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack \left( {d_{({{rn} + j_{0}})} - {\overset{\_}{d}}_{({{rn} + j_{0}})}} \right)^{2} \right\rbrack}} \right)}}{N}\sqrt{\frac{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}{{\overset{\Cap}{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack \left( {x_{({{rn} + j_{0}})} - {\overset{\_}{x}}_{({{rn} + j_{0}})}} \right)^{2} \right\rbrack}} \right)}}{N}}}}}}} \\ {= {\left( \frac{1}{r} \right){\sum\limits_{j_{0} = 0}^{r - 1}c_{j_{0}}}}} \end{matrix} & {{Equation}\quad(81)} \end{matrix}$

The numerator is the partial covariance of the two variables, and the denominator is the product of the corresponding partial standard deviations. The correlation coefficient is confined to the range [−1, 1]. When c=1 there is a perfect positive correlation between x and d, (i.e. they co-vary). When c=−1, there is a perfect negative correlation between x and d, (i.e. they vary in opposite ways, so that when x increases, y decreases). When c=0 there is no correlation between x and d, (i.e. the variables are called uncorrelated). Intermediate values describe partial correlations.

Parallel Correlation Coefficient for Multiple Variables.

The method of least squares is very powerful. Estimation theory says that the least square estimator is the “best linear unbiased estimator” (BLUE), since it has no bias and minimizes the variance. It can be generalized to higher order polynomial curves such as quadratics, cubic, etc. (the generalized least squares). In this case, nonlinear regression models are obtained and more coefficients need to be computed but the methodology still applies. Regression can also be extended to multiple variables, where the dependence of d is not a single variable x, but a vector X=[x₁, . . . , x_(p)]^(T), where T means the transpose and vectors are denoted by capital letters. In this case the regression line becomes a hyper-plane in the space x₁, x₂, . . . , x_(p). The autocorrelation of the input samples for indices k, j is defined as follows: $\begin{matrix} \begin{matrix} {R_{({n,j})} = {\frac{1}{N}{\sum\limits_{n}{x_{({n,k})}x_{({n,j})}}}}} \\ {= {\frac{1}{r}{\sum\limits_{j_{0} = 0}^{r - 1}{\frac{1}{\left( \frac{N}{r} \right)}{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}{{\overset{\Cap}{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack \left( {x_{({{{rn} + j_{0}},k})} - {\overset{\_}{x}}_{({{rn} + k})}} \right) \right\rbrack}} \right)}}}}}} \\ {= {\frac{1}{r}{\sum\limits_{j_{0} = 0}^{r - 1}{R_{j_{0_{({n,j})}}}.}}}} \end{matrix} & {{Equation}\quad(82)} \end{matrix}$

Knowing that the autocorrelation measures similarity across the input samples, when k=j, R is just the sum of the squares of the input samples (the power in the data). When k differs from j, R measures the sum of the cross-products for every possible combination of the indices. Thus, information about the structure of the data set is obtained as follows: $\begin{matrix} \begin{matrix} {P_{(j)} = {\frac{1}{N}{\sum\limits_{n}{x_{({n,j})}d_{(n)}}}}} \\ {= {\frac{1}{r}{\sum\limits_{j_{0} = 0}^{r - 1}{\frac{1}{\left( \frac{N}{r} \right)}{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}{x_{({{{rn} + j_{0}},j})}d_{({{rn} + j_{0}})}}}}}}} \\ {= {\frac{1}{r}{\sum\limits_{j_{0} = 0}^{r - 1}{P_{j_{0_{(j)}}}.}}}} \end{matrix} & {{Equation}\quad(83)} \end{matrix}$

Parallel Least Squares as a Search for the Parameters of a Parallel Linear System.

The purpose of least squares is to find parameters (b, w₁, w₂, . . . , w_(p)) that minimize the difference between the system output y_((n)) and the desired response d_((n)). So, regression is effectively computing the optimal parameters of an interpolating system which predicts the value of d from the value of x.

FIG. 10 shows graphically the operation of adapting the parameters of the linear system. The system output y is always a linear combination of the input x with the bias, so it has to lie on a straight line of equation y=wx+b. Changing b modifies the y intersect, while changing w modifies the slope. The goal of the parallel linear regression is to adjust the position of the set of line such that the average square difference between the y_((rn+j) ₀ ₎ values (on the line) and the cloud of points d_((rn+j) ₀ ₎ i.e. the error e_((rn+j) ₀ ₎, is minimized. The key point is to recognize that the error contains information which can be used to optimally place the line. FIG. 40 shows this by including a subsystem which accepts the error and modifies the parameters of the system. Thus, the error e_((rn+j) ₀ ₎ is fedback to the system and indirectly affects the output through a change in the parameters (b_(j) ₀ , w_(j) ₀ ). Effectively the system is made “aware” of its performance through the error. With the incorporation of the mechanism that automatically modifies the system parameters, a very powerful linear system can be built that will constantly seek optimal parameters. Such systems are called Adaptive systems, and are the focus of the present invention.

The apparatus 400 in FIG. 40 comprises a plurality of adaptive filters 402, feedback networks 404, a plurality of subtractors 406 and a Jaber product device 408. Each of the plurality of adaptive filters 402 processes each of subdivided input data and the subtractors subtracts the filtering outputs from the desired results, respectively. The error signal is used by the feedback networks 404 for adjusting coefficients of the adaptive filters 402. The Jaber product device 408 rearranges the filtering outputs to generate a final output.

The r one-dimensional parallel LMS algorithm.

Based on Equation (69), Equation (30) can be formulated as: $\begin{matrix} \begin{matrix} {{\nabla J_{(k)}} = {\nabla\left( {\frac{1}{r}{\sum\limits_{j_{0} = 0}^{r - 1}{{\overset{\Cap}{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack J_{j_{0}{(k)}} \right\rbrack}} \right)}}} \right)}} \\ {= {\frac{\partial}{\partial w}J_{(k)}}} \\ {= {\frac{\partial}{\partial w}\left( {\frac{1}{r}{\sum\limits_{j_{0} = 0}^{r - 1}{{\overset{\Cap}{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack J_{j_{0}{(k)}} \right\rbrack}} \right)}}} \right)}} \\ {\approx {\frac{\partial}{\partial w}\frac{1}{2{Nr}}{\sum e^{2}}}} \\ {= {\frac{\partial}{\partial w}\frac{1}{2{Nr}}{\sum\limits_{j_{0} = 0}^{r - 1}{\sum\limits_{n = 0}^{{(\frac{N}{r})} - 1}\left( {{\overset{\Cap}{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack e_{({{rn} + j_{0}})} \right\rbrack}} \right)} \right)^{2}}}}} \\ {\approx {\frac{1}{r}\frac{\partial}{\partial w}{\left( {{\overset{\Cap}{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack {e_{{({{rn} + j_{0}})}_{(k)}}^{2}x_{{({{rn} + j_{0}})}_{(k)}}} \right\rbrack}} \right)} \right).}}} \end{matrix} & {{Equation}\quad(84)} \end{matrix}$

If the estimator of Equation (30) is substituted in Equation (29), the steepest descent equation becomes: $\begin{matrix} {{{{\overset{\Cap}{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack w_{j_{0_{({k + 1})}}} \right\rbrack}} \right)} = {{\overset{\Cap}{*}}_{({r,r,\frac{N}{r}})}\left( {I_{r},{{col}\left\lbrack \left( {w_{j_{0_{(k)}}} + \left( {\left( \frac{1}{r} \right)\left( {\eta_{j_{0}}e_{{({{rn} + j_{0}})}_{(k)}}x_{{({{rn} + j_{0}})}_{(k)}}} \right)} \right)} \right) \right\rbrack}} \right)}};} & {{Equation}\quad(85)} \end{matrix}$ for j₀=0, 1, . . . , r−1.

This equation is the r parallel LMS algorithm, which is used as predictive filter, as illustrated in FIG. 36. With the LMS rule one does not need to worry about perturbation and averaging to properly estimate the gradient during each iteration, it is the iterative process that improves the gradient estimator. The small constant η is called the step size or the learning rate.

The function of the predictive adaptive filter is to provide the best prediction (in some sense) of the present value of a random signal. The present value of the signal thus serves the purpose of a desired response for the adaptive filter. Past values of the signal supply the input applied to the adaptive filter. Depending on the application of interest, the adaptive filter output or estimation (prediction) error may serve as the system output. In the first case, the system operates as a predictor; in the latter case, it operates as a prediction-error filter. This configuration could be used to enhance a sinusoid in broadband noise figure (42).

FIGS. 37-39 are simulation results of the partial LMS algorithm and the overall system. In the simulaslation a signal of size 256 (N=256) is predicted. The data is subdivided into odd and even data and then two predictive filters are used in parallel in order to predict the signal in each set of data and then the final result is obtained by rearranging the data through Jaber product device into a normal order.

Although the features and elements of the present invention are described in the preferred embodiments in particular combinations, each feature or element can be used alone without the other features and elements of the preferred embodiments or in various combinations with or without other features and elements of the present invention. 

1. An apparatus for performing regression of input data, the apparatus comprising: a plurality of multipliers for multiplying each of a plurality of subdivided input data with corresponding weights, respectively; a Jaber product device for rearranging the multiplication results and generating a regression output; a subtractor for subtracting the regression output from a desired result to generate an error; and, a feedback network for adjusting the weights in accordance with the error.
 2. The apparatus of claim 1 further comprising a multiplier for biasing the regression output.
 3. An apparatus for performing regression of input data, the apparatus comprising: a plurality of multipliers for multiplying each of a plurality of subdivided input data with corresponding weights, respectively; a plurality of subtractors for subtracting the multiplication results from a desired result to generate an error; a Jaber product device for rearranging the errors; and a feedback network for adjusting the weights in accordance with the error.
 4. The apparatus of claim 3 further comprising a plurality of multipliers for biasing the multiplication results.
 5. An apparatus for processing input data in parallel, the apparatus comprising: a plurality of adaptive filters for processing the input data in parallel, each adaptive filter accepting each of a plurality of subdivided input data and generating a filtered output; a plurality of subtractors, each subtractor for subtracting the filtered output from a desired output for generating error signals; a Jaber product device for rearranging the filtered output from the adaptive filters; and a feedback network for adjusting the adaptive filters in accordance with the error signals.
 6. A method for performing regression of input data, the method comprising: receiving input data; dividing the input data into a plurality of subgroups; multiplying each subdivided input data with corresponding weights, respectively, in parallel; rearranging the multiplication results to generate a regression output; subtracting the regression output from a desired result to generate an error; and adjusting the weights in accordance with the error.
 7. A method for processing input data in parallel, the method comprising: receiving input data; dividing the input data into a plurality of subgroups; filtering each of the subdivided input data in parallel with a plurality of adaptive filters; rearranging the filtered results to generate an output; subtracting the output from a desired result to generate an error; and adjusting the adaptive filters in accordance with the error. 